Final Project

Final
Author

Jamie, Tyler, Zion

Published

March 22, 2024

URL to the blog post:

https://jamie1130.github.io/PIC-16B/posts/Final Project/

Project Overview

Using the internet, it’s very easy to find information about hiking trails throughout the United States, but there’s a problem: it’s a lot easier to browse through lists of trails available online, like on the website trailforks.com, and filtering by location. What are you supposed to do if you don’t know much about an area but still want to go hiking, or what if you want to do certain things on your hike but don’t know where to go? You could do lots of research online yourself, reading articles or sifting through different locations, but this process can be very difficult, especially if you want to go somewhere that you’ve never been.

Our project addresses finding new places to visit based upon what type of hiking trails the user wants to see (which the user would describe) based upon previously hiked trails. For instance, if the user previously hiked Half Dome in Yosemite, then we give the user other trails and locations in the United States most similar to Half Dome according to reviews on TripAdvisor. This would be mainly for tourists who want to visit parts of the country they have not been to.

Let’s have a quick visual overview of the structure of the project.

image.png

Webscraping Trip Advisor - Tyler

import cv2 as cv
import matplotlib.pyplot as plt
import pandas as pd

Here we webscrape TripAdvisor to gather reviews for semantic analysis such that we can recommend trails based upon how similar reviews are to each other. We do so using scrapy!

This is what the first page looks like:

img = cv.imread('/content/tylercap/Capture.JPG')
plt.imshow(img)

The first function we write is parse. This function redirects users from the general national park page to the page with just activities to do.

def parse(self, response):
  next_page = response.xpath("//div//a[contains(@href, 'Attraction')]//@href").get() #xpath command to redirect to activities
  yield response.follow(next_page, callback = self.parse_full_credits) #redirects to the new url page and executes parse_full_credits

The second function we write is parse_full_credits. This function redirects users from the list of all activities to each individual trail page. We do this so that we can extract reviews and information of individual trails per national park.

This is what the second page looks like:

img = cv.imread('/content/tylercap/Capture2.JPG')
plt.imshow(img)

def parse_full_credits(self, response):
  trail_page = response.xpath("//div[@class = 'BYvbL A']//a[@class = 'BUupS _R w _Z y M0 B0 Gm wSSLS']//@href").getall()
  for trail in trail_page: #for every trail in the trial page list url, execute the callback command parse_actor_page
      yield response.follow(trail, callback = self.parse_trail)

Finally, the last function we write is parse_trail. This function outputs important data such as National Park name, state national park is in, trail name, overall trail rating, title of comment, text of comment, and individual comment rating.

def parse_trail(self, response):
  national_park = response.xpath("//span[@class = 'fxMOE']//text()").get()
  state = response.xpath("//span[@class = 'n q']//span[@class = 'biGQs _P pZUbB avBIb osNWb']//text()").getall()[1]
  trail = response.xpath("//h1//text()").get()
  comment_title = response.xpath("//div[@class = 'LbPSX']//div[@class = 'biGQs _P fiohW qWPrE ncFvv fOtGX']//span//text()").getall()
  ratings = response.xpath("//div[@class = 'LbPSX']//svg[@class = 'UctUV d H0']//title//text()").getall()
  comment_text = response.xpath("//div[@class = 'LbPSX']//span[@class = 'JguWG']//span[@class = 'yCeTE']//text()[1]").getall()
  pictures = response.xpath("//div[@class = 'LbPSX']//span[@class = 'biGQs _P XWJSj Wb']//img//@srcset").getall()
  overall_rating = response.xpath("//div[@class = 'biGQs _P fiohW hzzSG uuBRH']//text()").get()
  trail_type = response.xpath("//div[@class = 'biGQs _P pZUbB alXOW oCpZu GzNcM nvOhm UTQMg ZTpaU W KxBGd']//span//text()").get()
  for ix in range(len(comment_title)):
      yield {
          "national_park" : national_park,
          "state" : state,
          "trail":trail,
          "activity": trail_type,
          "overall_rating": overall_rating,
          "comment_title":comment_title[ix],
          "comment_ratings":ratings[ix],
          "comment_text":comment_text[ix]
      }

This is what the trail page with reviews looks like:

img = cv.imread('/content/tylercap/Capture3.JPG')
plt.imshow(img)

Now, we want to get this data in csv format. To do so, we go to the directory which holds our spider and run scrapy crawl trip_advisor -o national_parks.csv. Great, now we can analyze our reviews!

Review Similarity Trail Recommender - Tyler

Now, we utilize functions and word embedding to return the most similar trails and their location in the United States based upon the csv file we just created from our webscraper!

Firstly, let us import the packages we need. en_core_web_lg is a 560 MB model imported from SpaCy with has 514 thousand unique word vectors and reduces these vectors down to 300 dimensions for predictability. We do so to take advantage of built in word vectors so we don’t have to do this ourselves.

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import gensim
import spacy
!python -m spacy download en_core_web_lg
Collecting en-core-web-lg==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl (587.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 587.7/587.7 MB 2.9 MB/s eta 0:00:00
Requirement already satisfied: spacy<3.8.0,>=3.7.2 in /usr/local/lib/python3.10/dist-packages (from en-core-web-lg==3.7.1) (3.7.4)
Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.11 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (3.0.12)
Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (1.0.5)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (1.0.10)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (2.0.8)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (3.0.9)
Requirement already satisfied: thinc<8.3.0,>=8.2.2 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (8.2.3)
Requirement already satisfied: wasabi<1.2.0,>=0.9.1 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (1.1.2)
Requirement already satisfied: srsly<3.0.0,>=2.4.3 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (2.4.8)
Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (2.0.10)
Requirement already satisfied: weasel<0.4.0,>=0.1.0 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (0.3.4)
Requirement already satisfied: typer<0.10.0,>=0.3.0 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (0.9.0)
Requirement already satisfied: smart-open<7.0.0,>=5.2.1 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (6.4.0)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (4.66.2)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (2.31.0)
Requirement already satisfied: pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (2.6.4)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (3.1.3)
Requirement already satisfied: setuptools in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (67.7.2)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (24.0)
Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (3.3.0)
Requirement already satisfied: numpy>=1.19.0 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (1.25.2)
Requirement already satisfied: annotated-types>=0.4.0 in /usr/local/lib/python3.10/dist-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (0.6.0)
Requirement already satisfied: pydantic-core==2.16.3 in /usr/local/lib/python3.10/dist-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (2.16.3)
Requirement already satisfied: typing-extensions>=4.6.1 in /usr/local/lib/python3.10/dist-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (4.10.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (3.6)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (2.0.7)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (2024.2.2)
Requirement already satisfied: blis<0.8.0,>=0.7.8 in /usr/local/lib/python3.10/dist-packages (from thinc<8.3.0,>=8.2.2->spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (0.7.11)
Requirement already satisfied: confection<1.0.0,>=0.0.1 in /usr/local/lib/python3.10/dist-packages (from thinc<8.3.0,>=8.2.2->spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (0.1.4)
Requirement already satisfied: click<9.0.0,>=7.1.1 in /usr/local/lib/python3.10/dist-packages (from typer<0.10.0,>=0.3.0->spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (8.1.7)
Requirement already satisfied: cloudpathlib<0.17.0,>=0.7.0 in /usr/local/lib/python3.10/dist-packages (from weasel<0.4.0,>=0.1.0->spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (0.16.0)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->spacy<3.8.0,>=3.7.2->en-core-web-lg==3.7.1) (2.1.5)
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.7.1
✔ Download and installation successful
You can now load the package via spacy.load('en_core_web_lg')
⚠ Restart to reload dependencies
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
nlp = spacy.load('en_core_web_lg')

Load Data

Now let us load the data we scraped by TripAdvisor as well as a Excel file containing coordinate points of our national parks so that we can create a geograpical plot later

df = pd.read_csv('https://raw.githubusercontent.com/torwar02/trails/main/trails/national_parks.csv')
df2 = pd.read_excel('https://raw.githubusercontent.com/torwar02/trails/main/trails/coords.xlsx')
df.head()
national_park state trail activity overall_rating comment_title comment_ratings comment_text
0 Acadia National Park Maine (ME) Beech Mountain Trail Hiking Trails 4.5 Turned back on 3/20/21 due to ice 4.0 of 5 bubbles I have hiked to the fire tower a few times. It...
1 Acadia National Park Maine (ME) Beech Mountain Trail Hiking Trails 4.5 Spectacular 5.0 of 5 bubbles This trail was recommended in my Acadia travel...
2 Acadia National Park Maine (ME) Beech Mountain Trail Hiking Trails 4.5 Great Trail 5.0 of 5 bubbles Beech Mountain Trail is one of my favorites in...
3 Acadia National Park Maine (ME) Beech Mountain Trail Hiking Trails 4.5 Best trail in Acadia 5.0 of 5 bubbles We stumbled onto this trail and were very happ...
4 Acadia National Park Maine (ME) Beech Mountain Trail Hiking Trails 4.5 Great trail for family 5.0 of 5 bubbles My family has kids ranging from age 10 to 3. W...
df2.head()
Latitude Longitude Park State(s) Park Established Area Visitors (2018)
0 44.35 -68.21 Acadia Maine February 26, 1919 49,075.26 acres (198.6 km2) 3537575
1 -14.25 -170.68 American Samoa American Samoa October 31, 1988 8,256.67 acres (33.4 km2) 28626
2 38.68 -109.57 Arches Utah November 12, 1971 76,678.98 acres (310.3 km2) 1663557
3 43.75 -102.50 Badlands South Dakota November 10, 1978 242,755.94 acres (982.4 km2) 1008942
4 29.25 -103.25 Big Bend Texas June 12, 1944 801,163.21 acres (3,242.2 km2) 440091

To merge the two files together, we utilize regex. Get string preceding ‘National Park’ in df such that we can merge with df2 on National Park name

import re
pattern = r'(.*?)(?:\s+National Park)?$'
result = re.findall(pattern, df['national_park'].iloc[0])
park = []
for row in df['national_park']:
  test_park = re.findall(pattern, row)
  park.append(test_park[0])
df['park'] = park
national_parks = pd.merge(df, df2, left_on='park', right_on='Park')
national_parks = national_parks.drop(columns = ['park', 'Park', 'State(s)', 'Park Established'])
national_parks.head()
national_park state trail activity overall_rating comment_title comment_ratings comment_text Latitude Longitude Area Visitors (2018)
0 Acadia National Park Maine (ME) Beech Mountain Trail Hiking Trails 4.5 Turned back on 3/20/21 due to ice 4.0 of 5 bubbles I have hiked to the fire tower a few times. It... 44.35 -68.21 49,075.26 acres (198.6 km2) 3537575
1 Acadia National Park Maine (ME) Beech Mountain Trail Hiking Trails 4.5 Spectacular 5.0 of 5 bubbles This trail was recommended in my Acadia travel... 44.35 -68.21 49,075.26 acres (198.6 km2) 3537575
2 Acadia National Park Maine (ME) Beech Mountain Trail Hiking Trails 4.5 Great Trail 5.0 of 5 bubbles Beech Mountain Trail is one of my favorites in... 44.35 -68.21 49,075.26 acres (198.6 km2) 3537575
3 Acadia National Park Maine (ME) Beech Mountain Trail Hiking Trails 4.5 Best trail in Acadia 5.0 of 5 bubbles We stumbled onto this trail and were very happ... 44.35 -68.21 49,075.26 acres (198.6 km2) 3537575
4 Acadia National Park Maine (ME) Beech Mountain Trail Hiking Trails 4.5 Great trail for family 5.0 of 5 bubbles My family has kids ranging from age 10 to 3. W... 44.35 -68.21 49,075.26 acres (198.6 km2) 3537575

Word Embedding and Comment Similarity Score

First let us go over what Word Embedding is. Word embedding in NLP is an important technique that is used for representing words for text analysis in the form of real-valued vectors. In this approach, words and documents are represented in the form of numeric vectors allowing similar words to have similar vector representations. The extracted features are fed into a machine learning model so as to work with text data and preserve the semantic and syntactic information. This information once received in its converted form is used by NLP algorithms that easily digest these learned representations and process textual information.

Comment Similarity Function

Now let us create a function called comment_similarity which takes in our national_parks.csv file we just created via the park_data parameter, a comment_index parameter, and an all_comments parameter which is our word embedding vector representation of all comments in our csv file.

all_docs = [nlp(row) for row in national_parks['comment_text']] #getting vector representation of all comments in our csv file
def comment_similarity(parks_data, comment_index, all_comments):
  example_comment = parks_data.loc[comment_index, 'comment_text']
  reference_comment = nlp(example_comment) #vectorize our reference sentence
  simularity_score = []
  row_id = []
  for i in range(len(all_comments)):
    sim_score = all_comments[i].similarity(reference_comment)
    simularity_score.append(sim_score)
    row_id.append(i)
  simularity_docs = pd.DataFrame(list(zip(row_id, simularity_score)), columns = ['Comment_ID', 'sims'])
  simularity_docs_sorted = simularity_docs.sort_values(by = 'sims', ascending = False)
  most_similar_comments = simularity_docs_sorted['Comment_ID'][1:2]
  new_reviews = national_parks.iloc[most_similar_comments.values]
  return(new_reviews)

Now let us show what our returned dataframe looks like

showcase = comment_similarity(national_parks, 0, all_docs)
showcase
national_park state trail activity overall_rating comment_title comment_ratings comment_text Latitude Longitude Area Visitors (2018)
1552 Grand Canyon National Park Arizona (AZ) Grand Canyon South Rim Canyons 5.0 The views do not disappoint! 5.0 of 5 bubbles We were staying with family in Sun City (near ... 36.06 -112.14 1,201,647.03 acres (4,862.9 km2) 6380495

As we can see, we return a dataframe of the most similar review to the review with the index 999. To see how similar this similar review is to our inputted review let us output both comments.

First the original comment

example_comment = national_parks.loc[0, 'comment_text']
example_comment
"I have hiked to the fire tower a few times. Its a great hike, and not too strenuous elevation gains.  If the NO rangers are up there ( in the summer) they used to allow you to go up the tower. We had to turn back on 3/20 because of hard pack solid ice. We had our Katoohla micro spikes on, and solid hiking poles, and knew they simply  wouldn't be enough if the ice was on the steeper sections.  We walked into the trailhead because the access road gate is still closed. After deciding to cross the lot and hike Beech Cliff Loop, which was much more clear of ice, and has excellent views of Echo Lake and the ocean out toward  Southwest Harbor. We returned to BH to hear of the recovery of a young couple from Rutland Massachusetts  who had fallen 100 feet to their death on Dorr Mountain Gorge Trail. The tragedy attributed to ice on the trails. Anyone not experienced with full crampon travel, and ice climbing training should never attempt to hike or climb on solid ice. The danger is severe.. "

Now the similar comment.

showcase['comment_text'].iloc[0]
'We were staying with family in Sun City (near the Phoenix airport) and drove in our rental vehicle the approximate 3.5 hour drive to the south entrance of the Grand Canyon.  The park entrance was easy to find.  Parking this year was $35/vehicle.  I was skeptical going in, as several friends had this excursion on their "bucket list" while others simply raved.  I worried I would be disappointed.  However, the views absolutely spectacular!  We self-guided/toured.  We both experienced some vertigo and were careful to hang on to the railings provided, or sit on available benches as needed. Also bring water.  With the high elevation, it is easier to get winded, and water helps. We did have a hiker in front of us fall a few times from experiencing vertigo,and with assistance from others were able to help him get off the stairs and onto level ground to sit down.  He was embarrassed but grateful.  It could (and does) happen to anyone.  There were some areas that were roped off due to ice and snow and I was amazed how many people stupidly ignored the warnings and bypassed the barriers to get closer to the edge of the Canyon for selfies!   Check the weather in advance and dress appropriately.  The temperature was 30 degrees cooler in the Canyon than in the Phoenix area.  There were many families present and some pushing young ones in strollers.  On Feb 10, it was a chilly, windy, 40 degrees F.   There are lots of signs at various points educating you on the history of rocks, the Colorado river running through the Canyon, etc., and a small museum you can enter about 1.5 hours into the walk.  After our hike, we were exhausted and wind blown, and caught a shuttle back to the parking lot.  Kudos to those who can manage to walk the entire thing.  We didn\'t see everything the south side had to offer.  In our vehicle, we exited the park from the east side and for some 50+ miles, still saw the Grand Canyon from out the driver\'s side window. There were several spots along the way to stop and take more photos.  All in all, it was a physically and mentally stimulating journey that I highly recommend.'

As we can see the comments are very similar! They both talk about the dangers of the trail and how they both saw people fall.

Total Trail Simularity

Now let us create a function called total_similarity which takes in the same parameters as our last function except takes in the trail name instead of comment_index. We do so because we want to get all 10 comments per trail. Our total_similarity function calls comment_similarity to get the most similar comment per each individual comment of the 10 trails. As a result, we get 10 total similar trails returned to us.

def total_similarity(trail, parks_data, all_comments):
  trail_subset = parks_data[parks_data['trail'] == trail].index
  total_df = []
  for number in trail_subset:
    total_df.append(comment_similarity(national_parks, number, all_docs))
  df = pd.concat(total_df)
  return(df)
output = total_similarity("Landscape Arch", national_parks, all_docs)
output
national_park state trail activity overall_rating comment_title comment_ratings comment_text Latitude Longitude Area Visitors (2018)
303 Badlands National Park South Dakota (SD) Pinnacles Overlook Points of Interest & Landmarks 5.0 Must See Pullover 5.0 of 5 bubbles This is one of a handful of overlooks you have... 43.75 -102.50 242,755.94 acres (982.4 km2) 1008942
235 Arches National Park Utah (UT) Delicate Arch Points of Interest & Landmarks 5.0 Delicate Arch 5.0 of 5 bubbles Our family chose to hike to Delicate Arch late... 38.68 -109.57 76,678.98 acres (310.3 km2) 1663557
863 Capitol Reef National Park Utah (UT) Capitol Reef National Park National Parks 4.5 Add Capitol Reef to Your Utah National Park List 5.0 of 5 bubbles Just to the northeast of more popular parks Br... 38.20 -111.17 241,904.50 acres (979.0 km2) 1227627
1310 Death Valley National Park California (CA) Zabriskie Point Geologic Formations 4.5 The Most Iconic Place in Death Valley 4.0 of 5 bubbles You can't miss it. I don't mean you have to do... 36.24 -116.82 3,373,063.14 acres (13,650.3 km2) 1678660
1611 Grand Teton National Park Wyoming (WY) Taggart Lake Hiking Trails 5.0 Do this hike if you want to feel like you're a... 5.0 of 5 bubbles It's not a difficult hike and is right off the... 43.73 -110.80 310,044.22 acres (1,254.7 km2) 3491151
222 Arches National Park Utah (UT) Double Arch Hiking Trails 5.0 Easy hike 5.0 of 5 bubbles The Double Arch is unreal. It is massive and b... 38.68 -109.57 76,678.98 acres (310.3 km2) 1663557
3198 Mount Rainier National Park Washington (WA) Sunrise Visitor Center Visitor Centers 4.5 Amazing views 5.0 of 5 bubbles Amazing hikes of all varieties. Many travel up... 46.85 -121.75 236,381.64 acres (956.6 km2) 1518491
1439 Glacier National Park Montana (MT) Grinnell Glacier Hiking Trails 5.0 Incredible vies and the end-point is rewarding 5.0 of 5 bubbles This 13 mile hike from Many Glacier to upper G... 48.80 -114.00 1,013,125.99 acres (4,100.0 km2) 2965309
1366 Glacier National Park Montana (MT) Virginia Falls Waterfalls 5.0 Magnificent Falls in Glacier National Park - w... 5.0 of 5 bubbles This is the second falls on a hike in Glacier ... 48.80 -114.00 1,013,125.99 acres (4,100.0 km2) 2965309
650 Canyonlands National Park Utah (UT) Horseshoe Canyon Canyons 5.0 WHOA! READ PLEASE. Things you NEED to know a... 5.0 of 5 bubbles There are some older reviews. Some are VERY M... 38.20 -109.93 337,597.83 acres (1,366.2 km2) 739449

As we can see we get 10 similar trails to our desired trail Landscape Arch

Plotly Function

Now let us construct a geographical plot function called plotting_parks to get the location of these trails on a map. This is so that the user can better visualize where in the United States they may have to travel to. The function also analyzes other metrics from national_parks.csv such as visitors in 2018, type of activity, trail name, and overall TripAdvisor rating. This function calls total_similarity in order to get the dataframe with the most similar reviews!

from plotly import express as px
import plotly.io as pio
import inspect
pio.renderers.default="iframe"
def plotting_parks(trail, parks_data, all_comments, **kwargs):
  output = total_similarity(trail, parks_data, all_comments)
  fig = px.scatter_mapbox(output, lon = "Longitude", lat = "Latitude", color = "overall_rating",
                        color_continuous_midpoint = 2.5, hover_name = "national_park", height = 600,
                        hover_data = ["Visitors (2018)", "activity", "trail", "overall_rating"],
                        title = "Recommended National Park Trails",
                        size_max=50,
                        **kwargs,
                        )
  return fig
color_map = px.colors.diverging.RdGy_r # produce a color map
fig = plotting_parks("Landscape Arch", national_parks, all_docs, mapbox_style="carto-positron",
                                   color_continuous_scale = color_map)
fig.show()

image.png

Great, as we can see, we get a geo plot of the most similar National Park trails in the United States to Landscape Arch!

Webscraping TrailForks - Zion

Our original plan was to use a website called AllTrails which contains very comprehensive information about different hiking trails. Nonetheless, they beefed up their security measures several years ago to prevent people from scraping their website Because of this, we turned to a different site called TrailForks that, while still able to block scrapy, is unable to block Selenium.

How do you get started with Selenium?

Selenium is able to evade certain anti-bot measures by actually using an instance of a web browser (called a webdriver) that runs on your system while scraping. In fact, once you get the scraper to work, you can actually watch it run in real time! Unfortunately, that makes it a lot slower than scrapy, for instance, because your computer actually has to manually open every page. I used a Google Chrome webdriver. Below is (part of) the head of the scraper.py function which scrapes data from individual trails from TrailForks.

from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
import pandas as pd
from selenium import webdriver


chrome_options = webdriver.ChromeOptions()
options = webdriver.ChromeOptions()
options.add_experimental_option(
        "prefs", {
            # block image loading
            "profile.managed_default_content_settings.images": 2,
            "profile.managed_default_content_settings.javascript": 2
        }
    )
driver = webdriver.Chrome(
        service=service,
        options=options
    )
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--headless')

Of note are the experimental options under prefs which block both images and javascript content from loading on a website. When I first made the scraper, I did not have these enabled, and a result, sometimes pages would take between 3 and 5 seconds to load (way too long!). Similarly, the --headless argument also make the pages load faster by disabling certain Google Chrome functionalities: https://www.selenium.dev/blog/2023/headless-is-going-away/

How do you scrape using Selenium?

At its core, Selenium isn’t that different from scrapy in that you can have a scraper download HTML code which you can then filter through in Python as more familiar objects. Both scraper.py and scraper_parks.py (which filters through park-related information on TrailForks) have the same general principles: 1. Look at what state a user has inputted 2. Get links to all of the park/trail pages for that state 3. Get corresponding information from each page, using helper functions if need be 4. Add data to SQL database (see next section for that!)

scraper.py

As an example, let’s take a look at the function state_scraper in scraper.py. This is the outermost scraping function and takes in a state that the user inputs. However, for our project, we just used it to grab trails from California, since it would’ve taken a very long time to do this for the entire country, given that TrailForks has nearly 300,000 trails registered in the US alone.

def state_scraper(state_name):
    """
    This is the primary scraping function used inside of `main`.
    
    Inputs:
    ``href_list``: A list of each state's current page's trail urls, compiled by `main`
    ``state_name``: The name of the state in question.
    
    The function iterates through each url in ``href_list`` and using selenium's Chrome
    driver, it gets the `url` and calls the 3 individual scraping function: `scrape_basic_stats`,
    `scrape_tables` (twice--once for trail information and one for trail statistics), and
    `get_names_coords`. Once all of the information is gathered into dictionaries, they are
    combined and added into `trails.db`.
    
    """
    start_url = f"https://www.trailforks.com/region/{state_name}/trails//?activitytype=6&region=3106"
    url_list = [start_url]
    
    for page_num in range(state_page_dict[state_name]):
        url_list.append(start_url + f"&page={page_num+2}")
        
    for page_num in range(state_page_dict[state_name]):
        href_list = []
        driver.get(url_list[page_num])
        green_links = driver.find_elements("xpath","//tr//a[contains(@class, 'green')]")
        
        for green in green_links:
            href = green.get_attribute("href") #Grab URLs--otherwise this doesn't work
            href_list.append(href)
        

        for url in href_list:
            trail_keywords = ["Activities", "Riding Area", "Difficulty Rating", "Local Popularity"]
            stats_keywords = ["Altitude start", "Altitude end", "Grade"]
            basic_vars_dict = {"Distance": ["NA"], "Avg time": ["NA"], "Climb": ["NA"], "Descent": ["NA"]}
            trail_details_vars = {"Activities": ["NA"], "Riding Area": ["NA"], "Difficulty Rating": ["NA"], "Local Popularity": ["NA"]}
            trail_stats_vars = { "Altitude start": ["NA"], "Altitude end": ["NA"], "Grade": ["NA"]}
            name_coord_dict = {'Name': ["NA"],'Coords': ["NA"]}
            print(url)
            driver.get(url)
            basic_vars = scrape_basic_stats(basic_vars_dict)
            trail_stats = scrape_tables(stats_keywords, 'trailstats_display', trail_stats_vars)
            trail_details = scrape_tables(trail_keywords, 'traildetails_display', trail_details_vars)
            names_coords = get_names_coords(name_coord_dict)
    
            database_info.add_trails(pd.DataFrame({**names_coords,**basic_vars, **trail_details, **trail_stats}),state_name)
            
    driver.quit()

There’s a lot going on here! I’ll attach screenshots from TrailForks here to make it easier to follow along. The first thing that we do is establish a starting url which we get from TrailForks thanks to an f-string that contains the name of the state that we want. We then append that url to a url list. This list is important because we couldn’t actually get this to work without having it, since going back and forward is a bit harder in Selenium than it is in scrapy.

Then, we store the urls for all of the trail list pages that correspond to a given state, which on the website, looks like this:

image.png

The numbers come from a dictionary saved in database_info.py. Before making scraper.py, I scraped the page numbers.

Then, on each page, we grab the urls to individual trails. Note that Selenium uses a class called driver to interface with the page that it’s scraping, and the key methods here are called get (which just navigates to the page) and find_elements, which has two arguments. The first argument specifies how the scraper should interpret the second argument, which is some sort of instruction. For example, when we grab all of the links on the page to the trails, we use

 driver.find_elements("xpath","//tr//a[contains(@class, 'green')]")

where the second input is an xpath instruction that find_elements can read thanks to the fact that the first argument is simply a string that says xpath. There’s a lot of documentation for xpath available online, which I started to investigate for the movie database homework assignment. This website came in handy: https://www.w3schools.com/xml/xpath_syntax.asp

In the above example, we grab all links (a) contained in all table rows (tr) given that they are of class green (not sure why they’re calle that, but links to trails are of that class). Once we actually load it into python, though, we have to call

for green in green_links:
           href = green.get_attribute("href") #Grab URLs--otherwise this doesn't work
           href_list.append(href)

in a for loop to specifically take out the link portion of each a and then add it to the URL list.

## Individual Scraping Functions

Now we can get to the fun part. We essentially give a set of instructions to go to each trail’s page armed with a set of both lists containing keywords as well as dictionaries. The idea here is to search for specific variables stored in each trail and then update the value inside of the corresponding dictionary. As you can see, we split this task into four parts:


basic_vars = scrape_basic_stats(basic_vars_dict)
           trail_stats = scrape_tables(stats_keywords, 'trailstats_display', trail_stats_vars)
           trail_details = scrape_tables(trail_keywords, 'traildetails_display', trail_details_vars)
           names_coords = get_names_coords(name_coord_dict)

Let’s go over each one briefly.

###scraper_basic_stats

scraper_basic_stats takes the basic_vars_dict as an input:


def scrape_basic_stats(basic_vars_dict):
    """
    This function should be called within the `state_scraper` function.
    
    Inputs:
    ``basic_vars_dict``: a dictionary with the relevant variable names as keys and, by default,
    `NA` as values.
    
    On a given trail's website, it will find a `div` element of id `basicTrailStats`.
    This is a table containing information about the trail's length, average time to
    completion, ascent, and descent, if those variables are available.
    The function looks for `div`s of ID `padded10` which contain `col-3` class `div`s.
    The name of each variable is the `text` of the `small.grey` class within each `col-3`.
    The variable's value is stored in the `text` of `div` `large` or `large hovertip`
    
    The function uses conditional statements as not all of the variables are available.
    """
    try:
        basic_stats_div = driver.find_element(By.ID, "basicTrailStats")
        
        padded10_divs = basic_stats_div.find_elements(By.CLASS_NAME, "padded10")
        
        for padded10_div in padded10_divs:
            col_3_divs = padded10_div.find_elements(By.CLASS_NAME, "col-3")
            for col_3_div in col_3_divs:
                small_grey_div = col_3_div.find_element(By.CLASS_NAME, "small.grey")
                variable_name = small_grey_div.text
                large_div = col_3_div.find_element(By.CSS_SELECTOR, ".large, .large.hovertip")
                variable_value = large_div.text
                if variable_name in basic_vars_dict:
                    basic_vars_dict[variable_name] = variable_value
    except:
        pass
    return basic_vars_dict

This function essentialy looks for the light gray box at the top of each trail’s page which has some basic information like distance, climb, descent, and average completion time. This box has an ID of basicTrailStats (see how we use find_element with By.ID?) and contains a div called padded10, which in turn contains divs of col-3. These essentially work like dictionaries. Each one has a div of class large and small grey (note that we use periods here) whose text contains the information we want.

We have to make sure that each variable that we want is in our dictionary and that furthermore, it’s present on the trail page. Some trails have sparse or non-existent information for certain variables, and as a result, the specified variables simply don’t exist. This is also why this (and all other) helper function has its body in a try/except format, since it’s possible for basicTrailStats to not exist at all (only on rare occasions).

###scrape_tables


def scrape_tables(var_list, element_id, var_dict):
    """
    This function is used within `state_scraper` to access `ul`s with variable information.
    
    Inputs:
    ``var_list``: a list ofthe relevant variable names.
    ``element_id``: the `id` of the `ul` in question--there are two which need to be scraped.
    ``var_dict``: a dictionary with the relevant variable names as keys and, by default,
    `NA` as values.
    
    Each `ul` contains `li`s formatted like a dictionary with a `div` of class `term`,
    which stores variable names, and a `div` of class `definition` which stores variable
    values. Both `div`s store the key information in their `text`. Since not all variables
    are needed, we check if they are in ``var_list`` before assigning them to
    ``var_dict``'s values.
    
    """
    try:
        li_elements = driver.find_elements("xpath", f"//ul[contains(@id, '{element_id}')]//li")
        for li_element in li_elements:
            term_div = li_element.find_elements(By.CLASS_NAME, "term")
            definition_div = li_element.find_elements(By.CLASS_NAME, "definition")
            for idx, terms in enumerate(term_div):
                if terms.text in var_list:
                    var_dict[terms.text] = [definition_div[idx].text]
    
    except:
        pass
    return var_dict

This function is called twice in the body of the scraper because there are two similarly structured tables (one with general trail information and one with more detailed statistics) that we want to grab certain variables from. The tables are unordered lists (uls) which contain list elements (lis) that we can conveniently loop through by using the find_elements function which automatically converts it into a python iterable (similarly to scrapy). The lists are once again sorted like dictionaries, in fat, they contain classes called terms and definitions, which we then parse through one by one and use to update our dictionary, which is then returned.

### get_names_coords()

def get_names_coords(name_coord_dict):
  """
  This function is used within `state_scraper` to access each trail's name and coordinates.
  
  Inputs:
  ``name_coord_dict``: a dictionary with keys `Name` and `Coords`. By default, the values are `NA`
  
  The function grabs the trail's name from a `span` of class `translate` from the top of the page.
  It also grabs the coordinates from a `span` of class `grey2` within a `div` of class
  `margin-bottom-15`. The coordinates are stored in the `span`'s `text`.
  """
  try:
      name_raw = driver.find_element("xpath", "//span[contains(@class, 'translate')][1]")
      name_coord_dict['Name'] = [name_raw.text]
  except:
      name_coord_dict['Name'] = "NA"
  try:
      coord_raw = driver.find_element("xpath", "//div[contains(@class, 'margin-bottom-15 grey')]/span[contains(@class, 'grey2')][2]") #Get coords
      name_coord_dict['Coords'] = coord_raw.text
  except:
      name_coord_dict['Coords'] = "NA"
      
  return name_coord_dict

This is a bit more of a miscellaneous function, since it’s not dedicated to any one purpose. All it does is store the name of the trail as well as its coordinates. These are stored as two different classes of divs found on different parts of the page. Then, we just grab their text. Simple as that!

Once we grab all of that information, we add it to our SQL datbase:

database_info.add_trails(pd.DataFrame({**names_coords,**basic_vars, **trail_details, **trail_stats}),state_name)

which I’ll talk about in the next section.

image.png

scraper_parks.py

This file also contains a function called state_scraper which once again takes a state’s name as its input, but this time, it’s focused on collecting information about parks rather than trails (on TrailForks, there are many trails within one park). I was able to run this to get some numerical data about parks throughout the US (including national parks) which was then integrated into the website and recommender system.

This time, I don’t use helper functions (rather, I just keep it in the state_scraper’s body), so we’ll go through it bit by bit. Firstly, it’s worth going over what we actually want from each park’s page:

  1. Each park’s name and location
  2. The number of trails in each park, how long the trails are, and the popularity ranking.
  3. How many trails there are of each difficulty

##Set-up

The settings for Selenium are the same as in the last case. The pre-scraping part is almost identical as well:

start_url = f"https://www.trailforks.com/region/{state_name}/ridingareas/?activitytype=6"
    url_list = [start_url]

    for page_num in range(database_info.state_dictionary[state_name]):
        url_list.append(start_url + f"&page={page_num+2}")

    for page_num in range(state_page_dict[state_name]):
        href_list = []
        driver.get(url_list[page_num])
        green_links = driver.find_elements("xpath","//tr//a[contains(@class, 'green')]")

        for green in green_links:
            href = green.get_attribute("href") #Grab URLs--otherwise this doesn't work
            href_list.append(href+"/?activitytype=6")

Once again, I pre-scraped the number of pages required per state, though because there are fewer parks than trails, it wasn’t as big of a load on my computer.

Names and coordinates

Here’s the first part of where we actually scrape:


 for url in href_list:
            no_name_found = False
            info_dict = {"Name":["NA"], "Location":["NA"], "Coords":["NA"]}
            stats_dict =  {"Trails (view details)":["NA"],"Total Distance":["NA"], "State Ranking":["NA"],}
            trail_difficulty_count = {"Access Road/Trail":0,"White":0,"Green":0,"Blue":0,"Black":0,"Double Black Diamond":0, "Proline":0}
            print(url)
            driver.get(url)
            area_name_raw = driver.find_element("xpath", "//span[contains(@class, 'translate')][1]")
            info_dict["Name"] = area_name_raw.text
            try:
                city_name_raw = driver.find_element(By.CLASS_NAME, "small.grey2.mobile_hide")
                info_dict["Location"] = city_name_raw.text
            except:
                no_name_found = True
                

We create our three dictionaries that we want for each URL. The print(url) function is present as a debugging tool since, unfortunately, this scraper crashed multiple times due to unfixed bugs (which I eventually patched out, mostly due to elements not being present).

We get the name of the trail by finding a span with class translate (not sure why it’s stored like that, it’s actually within an h1 within a ul called page_title_container). Then, we try to look for the name of the city that it’s in by grabbing a small piece of text that’s next to the park’s name. Sometimes, this isn’t present, which is why we have a bool called no_name_found in case it’s not. There’s a way around this, though, which we’ll show later…

###Ranking, Distance, and Trail numbers:

    stats_items = ["State Ranking", "Total Distance", "Trails (view details)"]
            dict_category = driver.find_elements("xpath", "//dl//dt")
            dict_information = driver.find_elements("xpath", "//dl//dd")
           
            for idx, terms in enumerate(dict_category):
                if terms.text in stats_items:
                    stats_dict[terms.text] = [dict_information[idx].text]
                    
            try:
                difficulty_ul = driver.find_element(By.CLASS_NAME, 'stats.flex.nostyle.inline.clearfix')

                for li in difficulty_ul.find_elements(By.TAG_NAME, 'li'):
                    difficulty_span = li.find_element(By.XPATH, './/span[contains(@class, "stat-label clickable")]/span')
                    difficulty_name = difficulty_span.get_attribute('title')
                    if difficulty_name in trail_difficulty_count.keys():
                        num_trails_span = li.find_element(By.CLASS_NAME, 'stat-num')
                        num_trails = int(num_trails_span.text)
                        trail_difficulty_count[difficulty_name] = num_trails

The code here is somewhat dense thanks to the fact that all of this information is stored in a dictionary-like object called a dl which, in turn, has something like a key in a dl and something like a value in a dd. Essentially, we update the ranking and trail distances by inspecting these.

It’s a little bit harder to get the number of trails per difficulty. Basically, there’s an unordered list with a long class name ('stats.flex.nostyle.inline.clearfix' that sorts the number of trails by difficulty. Each li has the number of trails stored within it, but it also has a graphic that represents the difficulty (it’s a small picture), and it’s the graphic that actually hides the name of the difficulty, which is why we have to extract difficulty_name from a span of class stat-label clickable. Then, we simply grab the actual text that displays how many trails of a given difficulty there are, convert it to an integer, and then add it to our dictionary.

Coordinates

One of the unfortunate parts of the parks list is that the coordinates of each park are not present! To get around this, we tell the scraper to go to the first trail in each park and grab its coordinates (remember scraper.py?) and then store it.

  try:
                green_link = driver.find_element("xpath","//tr//a[contains(@class, 'green')]")
                park_link = green_link.get_attribute("href")
                driver.get(park_link)
            except:
                pass
                
            try:
                coord_raw = driver.find_element("xpath", "//div[contains(@class, 'margin-bottom-15 grey')]/span[contains(@class, 'grey2')][2]") #Get coords
                info_dict['Coords'] = [coord_raw.text]
                if no_name_found:
                    city_name_raw = driver.find_element(By.CLASS_NAME, "weather_date bold green")
                    info_dict["Location"] = city_name_raw.text
            except:
                info_dict['Coords'] = ["NA"]

It’s here where we also resolve the issue of when we can’t find a city’s name. Basically, on each trail’s page, there’s a short infobox containing weather information for the nearest city which is guaranteed to appear, so we can get an approximate location name precisely by grabbing the city name from this box.

Once we’re done with that, it’s off to the database again!

SQL Database - Zion

There’s a lot of information that we scrape from TrailForks which has to be managed within a SQL datbase for easy access. For example, California has more than 16,000 trails, and for each trail, we collected 12 variables (see above), so that means that there’s more than 192,000 entries! We used sqlite3 to easily manage a database, or rather, two: one is called trails.db and contains individual trails (specifically, those in California, though our original plan was to include the entire country), and one called trails_new.db (now that I think about it, I probably should’ve given it a different name) which contains park information, where each state has a different table.

##database_info.py

Everything relevant to managing the datbase is stored in a different python file called database_info.py. Here I can show you the structure of both databases:

###Making the databases

def make_db(state):
    conn = sqlite3.connect("trails.db")
    cmd = f"""
    CREATE TABLE IF NOT EXISTS {state_name_code_name_dict[state]}(
    name VARCHAR(255),
    coords VARCHAR(255),
    Distance VARCHAR(255),
    'Avg time' VARCHAR(255),
    Climb VARCHAR(255),
    Descent VARCHAR(255),
    Activities VARCHAR(255),
    'Riding Area' VARCHAR(255),
    'Difficulty Rating' VARCHAR(255),
    'Dogs Allowed' VARCHAR(255),
    'Local Popularity' VARCHAR(255),
    'Altitude start' VARCHAR(255),
    'Altitude end' VARCHAR(255),
    Grade VARCHAR(255)
    );
    """
    cursor = conn.cursor()
    cursor.execute(cmd)
    cursor.close()
    conn.close()
    
def make_db_parks(state):
    conn = sqlite3.connect("trails_new.db")
    cmd = f"""
    CREATE TABLE IF NOT EXISTS {state_name_code_name_dict[state]}(
    Name VARCHAR(255),
    Location VARCHAR(255),
    Coords VARCHAR(255),
    'Trails (view details)' SMALLINT(255),
    'Total Distance' VARCHAR(255),
    'State Ranking' VARCHAR(255),
    'Access Road/Trail' SMALLINT(255),
    White SMALLINT(255),
    Green SMALLINT(255),
    Blue SMALLINT(255),
    Black SMALLINT(255),
    'Double Black Diamond' SMALLINT(255),
    Proline SMALLINT(255)
    );
    """
    cursor = conn.cursor()
    cursor.execute(cmd)
    cursor.close()
    conn.close()

These two functions were run in order to actually create the datbaase for the first time. They contain the variables as mentioned previously, mostly in the form of text.

Adding information

If you recall from the scraping functions, there was a function call that would add information from each park to the SQL database. Here’s the source code for those functions:

def get_db():
    conn = sqlite3.connect("trails.db")
    return conn
    
def add_trails(df,state):
    conn = get_db()
    df.to_sql(state, conn, if_exists = "append", index = False)
    
def get_db_new():
    conn = sqlite3.connect("trails_new.db")
    return conn
    
def add_trails_new(df,state):
    conn = get_db_new()
    df.to_sql(state, conn, if_exists = "append", index = False)

The functions get_db and get_db_new (most things relating to scraper_parks are labeled new since we did this second) establish connections to their respective databases. add_trails and add_trails_new, therefore, are actually responsible for adding entries to each database. Note that they take a df as one input (which contains the scraped info) and a state name, which sends the information to the correct table.

Miscellaneous Tables

There are several dictionaries and lists that we generated in order to make the functions easier to run:

states = ["Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado", "Connecticut", "Delaware", "Florida", "Georgia", "Hawaii", "idaho-3166", "Illinois", "Indiana", "Iowa", "Kansas", "Kentucky", "Louisiana", "Maine", "Maryland", "Massachusetts", "Michigan", "Minnesota", "Mississippi", "Missouri", "Montana", "Nebraska", "Nevada", "new-hampshire", "new-jersey", "new-mexico", "new-york", "north-carolina", "north-dakota", "Ohio", "Oklahoma", "Oregon", "Pennsylvania", "rhode-island", "south-carolina", "south-dakota", "Tennessee", "Texas", "Utah", "Vermont", "Virginia", "Washington", "west-virginia", "Wisconsin", "Wyoming"]

state_name_code_name_dict = {
    'Alabama': 'Alabama',
    'Alaska': 'Alaska',
    'Arizona': 'Arizona',
    'Arkansas': 'Arkansas',
    'California': 'California',
    'Colorado': 'Colorado',
    'Connecticut': 'Connecticut',
    'Delaware': 'Delaware',
    'Florida': 'Florida',
    'Georgia': 'Georgia',
    'Hawaii': 'Hawaii',
    'idaho-3166': 'Idaho',
    'Illinois': 'Illinois',
    'Indiana': 'Indiana',
    'Iowa': 'Iowa',
    'Kansas': 'Kansas',
    'Kentucky': 'Kentucky',
    'Louisiana': 'Louisiana',
    'Maine': 'Maine',
    'Maryland': 'Maryland',
    'Massachusetts': 'Massachusetts',
    'Michigan': 'Michigan',
    'Minnesota': 'Minnesota',
    'Mississippi': 'Mississippi',
    'Missouri': 'Missouri',
    'Montana': 'Montana',
    'Nebraska': 'Nebraska',
    'Nevada': 'Nevada',
    'new-hampshire': 'NewHampshire',
    'new-jersey': 'NewJersey',
    'new-mexico': 'NewMexico',
    'new-york': 'NewYork',
    'north-carolina': 'NorthCarolina',
    'north-dakota': 'NorthDakota',
    'Ohio': 'Ohio',
    'Oklahoma': 'Oklahoma',
    'Oregon': 'Oregon',
    'Pennsylvania': 'Pennsylvania',
    'rhode-island': 'RhodeIsland',
    'south-carolina': 'SouthCarolina',
    'south-dakota': 'SouthDakota',
    'Tennessee': 'Tennessee',
    'Texas': 'Texas',
    'Utah': 'Utah',
    'Vermont': 'Vermont',
    'Virginia': 'Virginia',
    'Washington': 'Washington',
    'west-virginia': 'WestVirginia',
    'Wisconsin': 'Wisconsin',
    'Wyoming': 'Wyoming'
}


state_dictionary = {'Alabama': 11, 'Alaska': 11, 'Arizona': 49, 'Arkansas': 16, 'California': 152, 'Colorado': 69, 'Connecticut': 56, 'Delaware': 4, 'Florida': 18, 'Georgia': 17, 'Hawaii': 5, 'idaho-3166': 31, 'Illinois': 51, 'Indiana': 10, 'Iowa': 8, 'Kansas': 3, 'Kentucky': 9, 'Louisiana': 2, 'Maine': 27, 'Maryland': 16, 'Massachusetts': 146, 'Michigan': 55, 'Minnesota': 36, 'Mississippi': 3, 'Missouri': 11, 'Montana': 41, 'Nebraska': 3, 'Nevada': 16, 'new-hampshire': 41, 'new-jersey': 40, 'new-mexico': 25, 'new-york': 60, 'north-carolina': 26, 'north-dakota': 7, 'Ohio': 29, 'Oklahoma': 4, 'Oregon': 38, 'Pennsylvania': 54, 'rhode-island': 9, 'south-carolina': 6, 'south-dakota': 7, 'Tennessee': 16, 'Texas': 50, 'Utah': 62, 'Vermont': 25, 'Virginia': 27, 'Washington': 92, 'west-virginia': 18, 'Wisconsin': 25, 'Wyoming': 19}

state_parks_dictionary = {'Alabama': 1, 'Alaska': 1, 'Arizona': 3, 'Arkansas': 2, 'California': 8, 'Colorado': 4, 'Connecticut': 7, 'Delaware': 1, 'Florida': 2, 'Georgia': 2, 'Hawaii': 1, 'idaho-3166': 2, 'Illinois': 10, 'Indiana': 1, 'Iowa': 1, 'Kansas': 1, 'Kentucky': 1, 'Louisiana': 1, 'Maine': 3, 'Maryland': 1, 'Massachusetts': 7, 'Michigan': 5, 'Minnesota': 3, 'Mississippi': 1, 'Missouri': 2, 'Montana': 2, 'Nebraska': 1, 'Nevada': 1, 'new-hampshire': 3, 'new-jersey': 3, 'new-mexico': 2, 'new-york': 5, 'north-carolina': 3, 'north-dakota': 1, 'Ohio': 4, 'Oklahoma': 1, 'Oregon': 3, 'Pennsylvania': 3, 'rhode-island': 1, 'south-carolina': 1, 'south-dakota': 1, 'Tennessee': 2, 'Texas': 4, 'Utah': 3, 'Vermont': 2, 'Virginia': 2, 'Washington': 6, 'west-virginia': 2, 'Wisconsin': 3, 'Wyoming': 1}

state_dictionary and state_parks_dictionary store the number of pages required for each state. states simply contains the names of all the states in alphabetical order, and state_name_code_dict helps sort between the name of a state and the way in which it is displayed on TrailForks URLs.

Connecting National Parks to Individual Trail/Park Info

Now we need to make sure to connect the data that we’ve collected here with the actual table generated by the recommender to give the user more information. Let’s take a look at our output from the similarity score model:

output
national_park state trail activity overall_rating comment_title comment_ratings comment_text Latitude Longitude Area Visitors (2018)
303 Badlands National Park South Dakota (SD) Pinnacles Overlook Points of Interest & Landmarks 5.0 Must See Pullover 5.0 of 5 bubbles This is one of a handful of overlooks you have... 43.75 -102.50 242,755.94 acres (982.4 km2) 1008942
235 Arches National Park Utah (UT) Delicate Arch Points of Interest & Landmarks 5.0 Delicate Arch 5.0 of 5 bubbles Our family chose to hike to Delicate Arch late... 38.68 -109.57 76,678.98 acres (310.3 km2) 1663557
863 Capitol Reef National Park Utah (UT) Capitol Reef National Park National Parks 4.5 Add Capitol Reef to Your Utah National Park List 5.0 of 5 bubbles Just to the northeast of more popular parks Br... 38.20 -111.17 241,904.50 acres (979.0 km2) 1227627
1310 Death Valley National Park California (CA) Zabriskie Point Geologic Formations 4.5 The Most Iconic Place in Death Valley 4.0 of 5 bubbles You can't miss it. I don't mean you have to do... 36.24 -116.82 3,373,063.14 acres (13,650.3 km2) 1678660
1611 Grand Teton National Park Wyoming (WY) Taggart Lake Hiking Trails 5.0 Do this hike if you want to feel like you're a... 5.0 of 5 bubbles It's not a difficult hike and is right off the... 43.73 -110.80 310,044.22 acres (1,254.7 km2) 3491151
222 Arches National Park Utah (UT) Double Arch Hiking Trails 5.0 Easy hike 5.0 of 5 bubbles The Double Arch is unreal. It is massive and b... 38.68 -109.57 76,678.98 acres (310.3 km2) 1663557
3198 Mount Rainier National Park Washington (WA) Sunrise Visitor Center Visitor Centers 4.5 Amazing views 5.0 of 5 bubbles Amazing hikes of all varieties. Many travel up... 46.85 -121.75 236,381.64 acres (956.6 km2) 1518491
1439 Glacier National Park Montana (MT) Grinnell Glacier Hiking Trails 5.0 Incredible vies and the end-point is rewarding 5.0 of 5 bubbles This 13 mile hike from Many Glacier to upper G... 48.80 -114.00 1,013,125.99 acres (4,100.0 km2) 2965309
1366 Glacier National Park Montana (MT) Virginia Falls Waterfalls 5.0 Magnificent Falls in Glacier National Park - w... 5.0 of 5 bubbles This is the second falls on a hike in Glacier ... 48.80 -114.00 1,013,125.99 acres (4,100.0 km2) 2965309
650 Canyonlands National Park Utah (UT) Horseshoe Canyon Canyons 5.0 WHOA! READ PLEASE. Things you NEED to know a... 5.0 of 5 bubbles There are some older reviews. Some are VERY M... 38.20 -109.93 337,597.83 acres (1,366.2 km2) 739449

Because we have two different SQL databases, one for nation-wide park data (trails_new.db) and one with state-wide trail data (trails.db), let’s split this into two different frames.

california_df = output[output['state'] == 'California (CA)']
non_california_df = output[output['state'] != 'California (CA)']

Now we’ll get our databases in our notebook:

!wget https://raw.githubusercontent.com/torwar02/trails/main/trails/trails.db -O trails.db
!wget https://raw.githubusercontent.com/torwar02/trails/main/trails/trails_new.db -O trails_new.db
--2024-03-22 20:55:28--  https://raw.githubusercontent.com/torwar02/trails/main/trails/trails.db
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3534848 (3.4M) [application/octet-stream]
Saving to: ‘trails.db’

trails.db           100%[===================>]   3.37M  --.-KB/s    in 0.07s   

2024-03-22 20:55:28 (48.8 MB/s) - ‘trails.db’ saved [3534848/3534848]

--2024-03-22 20:55:28--  https://raw.githubusercontent.com/torwar02/trails/main/trails/trails_new.db
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1339392 (1.3M) [application/octet-stream]
Saving to: ‘trails_new.db’

trails_new.db       100%[===================>]   1.28M  --.-KB/s    in 0.06s   

2024-03-22 20:55:28 (23.1 MB/s) - ‘trails_new.db’ saved [1339392/1339392]

There’s a bit of an issue, though. Let’s look at our table names:

import sqlite3
db_path = 'trails_new.db'

conn = sqlite3.connect(db_path) #Establish connection with DB
cur = conn.cursor()

cur.execute("SELECT name FROM sqlite_master WHERE type='table';") #This specifically grabs all table names from our datbaase.
tables = cur.fetchall()
table_names = [table[0] for table in tables] #Places them into a list
print("List of tables in the database:", table_names)
conn.close()
List of tables in the database: ['Maine', 'California', 'Alabama', 'Alaska', 'Arizona', 'Arkansas', 'Colorado', 'Connecticut', 'Delaware', 'Florida', 'Georgia', 'Hawaii', 'idaho-3166', 'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada', 'new-hampshire', 'new-jersey', 'new-mexico', 'new-york', 'north-carolina', 'north-dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania', 'rhode-island', 'south-carolina', 'south-dakota', 'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington', 'west-virginia', 'Wisconsin', 'Wyoming']

Our tables aren’t completely in alphabetical order (I was testing around with Maine first, for instance). And some of them aren’t two words, like south-dakota, for instance. But if we compare this to what we have in output:

set(output['state'])
{'California (CA)',
 'Montana (MT)',
 'South Dakota (SD)',
 'Utah (UT)',
 'Washington (WA)',
 'Wyoming (WY)'}

Here we have nice, capitalized state names with two-letter abbreviations. So, then, how are we going to fix this? We’re going to create a dictionary that essentially works as a mapping that takes what we have in output and matches it to what exists in table_names based on some matching criteria:

# Extract unique states and sort them
unique_states_in_output = sorted(set(output['state']), key=str.lower)
table_names = sorted(table_names, key=str.lower)



def compare_letters(state_name, table_name):
    clean_state_name = ''.join(filter(str.isalpha, state_name)).lower() #Eliminate non-alphabetical characters, condense together
    clean_table_name = ''.join(filter(str.isalpha, table_name)).lower()
    return sorted(clean_state_name) == sorted(clean_table_name) #Gives a boolean value.

state_name_to_table_name = {} #Create new dictionary
for state_with_abbreviation in unique_states_in_output:
    state_name = state_with_abbreviation.split(' (')[0]  # Get rid of the parentheses in the abbreviation (like 'South Dakota (SD)')
    match = next((table for table in table_names if compare_letters(state_name, table)), None) #Generator based on whether or not names are the same
    if match:
        state_name_to_table_name[state_with_abbreviation] = match #Update dict if match found

print(state_name_to_table_name)
{'California (CA)': 'California', 'Montana (MT)': 'Montana', 'South Dakota (SD)': 'south-dakota', 'Utah (UT)': 'Utah', 'Washington (WA)': 'Washington', 'Wyoming (WY)': 'Wyoming'}

Now that’s what we’re looking for! We do a few important things here:

Firstly, we make sure to get both the states that we have in output and the tables in table_names in alphabetical order. The reason why we do key=str.lower is because some of the table names are written in uppercase while others are in lowercase. This makes it case-insensitive.

Then we create a helper function called compare_letters which takes two state names (one from output, one from the database) and compares them to see if they have the same letters. We do this by filtering out non-alphabetical characters, spaces, and making everything lowercase and just checking if they have the same letters. The function will just return True or False depending on whether or not they match.

We actually use state_name_to_table_name in the for loop below this. We go through each of the states in output. Then, we extract just the part of the state name that comes before the two-letter abbreviation, and then we create a generator that individualls calls compare_letters on each of the names. If it returns True, then we have a match, which then causes the dictionary to be updated. Otherwise, nothing happens and we simply move onto the next entry (that’s why the second argument of next is none).

##Logic for linking databases

Our goal is to now go through each recommendation, and match up either the park or trail information corresponding to it (assuming that it’s present in the database). One issue that can arise with this, however, is that the name of the park in output might be different from that of the database. To mitigate this, we’re going to instead compare the coordinates of what’s in output to the rows inside of trails_new.db and trails.db. The idea is that if two parks are close enough to each other in terms of their coordinates, then they should represent the same thing. So, we’re going to make two functions that do similar (but different) things. One will be called fetch_park_info_based_on_coords which will look at parks (i.e., outside of California), and the other will be called fetch_trail_info_based_on_coords

def fetch_park_info_based_on_coords(db_name, latitude, longitude, margin_lat, margin_long):
    conn = sqlite3.connect(db_name)
    cursor = conn.cursor() #Connect to database

    for table_name in state_name_to_table_name.values(): #This is what we made earlier
        cursor.execute(f"SELECT * FROM \"{table_name}\"") #Grab everything from the table
        rows = cursor.fetchall()

        for row in rows: #For each row
            coords_text = row[2]  # Coords are in the third column
            try:
                coords = eval(coords_text) #Kept as a tuple, essentially
                lat_diff = abs(coords[0] - latitude)
                long_diff = abs(coords[1] - longitude)

                if lat_diff < margin_lat and long_diff < margin_long:
                    return row[3:]  # Don't need name and coords
            except:
                continue

    conn.close()
    return None

def fetch_trail_info_based_on_coords(db_name, latitude, longitude, margin_lat, margin_long):
    conn = sqlite3.connect(db_name)
    cursor = conn.cursor()
    table_name = 'California'  #Only getting CA trails

    cursor.execute(f"SELECT * FROM {table_name}") #Grab everything
    rows = cursor.fetchall()

    for row in rows:
        coords_text = row[1]  # Coords are in column 2
        try:
            coords = eval(coords_text)
            lat_diff = abs(coords[0] - latitude)
            long_diff = abs(coords[1] - longitude)

            if lat_diff < margin_lat and long_diff < margin_long:
                return row[2:]
        except:
            continue  # Skip rows with invalid 'Coords'

    conn.close()
    return None

Okay, so, it will make a lot more sense if we actually inspect the structure of our database again. Click the link below to see screenshots of two .csv files: the first of parks in Wyoming, and the second is of trails in California:

https://imgur.com/a/6fCixEt

With that out of the way, let’s dive into the code. We go through the mapping dictionary that we made previously and we grab all of the possible parks from each one. Then, we look at the third column (i.e., row[2], which represents the third entry in the row) which corresponds to the coordinates (see screenshot), and we record the absolute difference in the coordinates between a given latitude and longitude (we’ll be taking those from output–they’re individual columns rather than a tuple). If both of them are within a specified margin of error, then we’ve found our match. Note that we’re only going to return everything starting from the third column: the first 2 are just the name and coordinates of the trail.

For fetch_trail_info_based_on_coords, we have a very similar set-up except for the fact that the coordinates are in the second column, and we’re interested in returning everything after the first two.

Now, let’s move on so we can see how we actually use these functions!

Putting it all together

The first thing we’re going to do is to specify the names of the new columns that we want to put into california_df and non_california_df. I’ve just grabbed these from the database:

new_columns = [
    'Trails (view details)', 'Total Distance', 'State Ranking',
    'Access Road/Trail', 'White', 'Green', 'Blue', 'Black',
    'Double Black Diamond', 'Proline'
]
new_trail_columns = [
    'Distance', 'Avg time', 'Climb', 'Descent', 'Activities',
    'Riding Area', 'Difficulty Rating', 'Dogs Allowed',
    'Local Popularity', 'Altitude start', 'Altitude end', 'Grade'
]

Now, all we need to do is iterate through the rows of non_california_df to match up the entires!

margin_lat = 0.1  # Decently generous
margin_long = 0.1
for index, row in non_california_df.iterrows():
    if pd.isna(row['Latitude']) or pd.isna(row['Longitude']): #Some parks have NA coordinates
        continue
    park_info = fetch_park_info_based_on_coords('trails_new.db', row['Latitude'], row['Longitude'], margin_lat, margin_long)
    #Remember, this grabs almost all of the columns if a match is found
    if park_info:
        non_california_df.loc[index, new_columns] = park_info #We can mass-add new columns

In the above code, we use the fetch_park_info_based_on_coords function to essentially create a new data frame that contains the information that we want once we match the coordinates. Then, we insert all of these as new columns, taking advantage of the .loc() method from pandas. Now let’s do the same thing for the California df:


for index, row in california_df.iterrows():
    if pd.isna(row['Latitude']) or pd.isna(row['Longitude']):
        continue

    park_info = fetch_trail_info_based_on_coords('trails.db', row['Latitude'], row['Longitude'], margin_lat, margin_long)

    if park_info and len(park_info) == len(new_trail_columns):
        california_df.loc[index, new_trail_columns] = park_info
    else:
        pass

Okay, let’s take a look at our results!

non_california_df
national_park state trail activity overall_rating comment_title comment_ratings comment_text Latitude Longitude ... Trails (view details) Total Distance State Ranking Access Road/Trail White Green Blue Black Double Black Diamond Proline
303 Badlands National Park South Dakota (SD) Pinnacles Overlook Points of Interest & Landmarks 5.0 Must See Pullover 5.0 of 5 bubbles This is one of a handful of overlooks you have... 43.75 -102.50 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
235 Arches National Park Utah (UT) Delicate Arch Points of Interest & Landmarks 5.0 Delicate Arch 5.0 of 5 bubbles Our family chose to hike to Delicate Arch late... 38.68 -109.57 ... 40 50 miles #7,493 6.0 0.0 0.0 0.0 0.0 0.0 0.0
863 Capitol Reef National Park Utah (UT) Capitol Reef National Park National Parks 4.5 Add Capitol Reef to Your Utah National Park List 5.0 of 5 bubbles Just to the northeast of more popular parks Br... 38.20 -111.17 ... 60 194 miles #9,609 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1611 Grand Teton National Park Wyoming (WY) Taggart Lake Hiking Trails 5.0 Do this hike if you want to feel like you're a... 5.0 of 5 bubbles It's not a difficult hike and is right off the... 43.73 -110.80 ... 26 53 miles #4,761 1.0 1.0 0.0 1.0 0.0 0.0 0.0
222 Arches National Park Utah (UT) Double Arch Hiking Trails 5.0 Easy hike 5.0 of 5 bubbles The Double Arch is unreal. It is massive and b... 38.68 -109.57 ... 40 50 miles #7,493 6.0 0.0 0.0 0.0 0.0 0.0 0.0
3198 Mount Rainier National Park Washington (WA) Sunrise Visitor Center Visitor Centers 4.5 Amazing views 5.0 of 5 bubbles Amazing hikes of all varieties. Many travel up... 46.85 -121.75 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1439 Glacier National Park Montana (MT) Grinnell Glacier Hiking Trails 5.0 Incredible vies and the end-point is rewarding 5.0 of 5 bubbles This 13 mile hike from Many Glacier to upper G... 48.80 -114.00 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1366 Glacier National Park Montana (MT) Virginia Falls Waterfalls 5.0 Magnificent Falls in Glacier National Park - w... 5.0 of 5 bubbles This is the second falls on a hike in Glacier ... 48.80 -114.00 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
650 Canyonlands National Park Utah (UT) Horseshoe Canyon Canyons 5.0 WHOA! READ PLEASE. Things you NEED to know a... 5.0 of 5 bubbles There are some older reviews. Some are VERY M... 38.20 -109.93 ... 25 177 miles #9,011 9.0 0.0 0.0 0.0 0.0 0.0 0.0

9 rows × 22 columns

Success! It looks like we unfortunately have a few NA values. Unfortunately, it’s hard to guarantee precision in the coordinates. We only had one trail for California:

california_df
national_park state trail activity overall_rating comment_title comment_ratings comment_text Latitude Longitude Area Visitors (2018)
1310 Death Valley National Park California (CA) Zabriskie Point Geologic Formations 4.5 The Most Iconic Place in Death Valley 4.0 of 5 bubbles You can't miss it. I don't mean you have to do... 36.24 -116.82 3,373,063.14 acres (13,650.3 km2) 1678660

Wait, really? I thought we would’ve had this for sure in our database…

On closer inspection, we actually do, but hte coordinates on TrailForks versus what we got from the National Park data is a bit off. On the TrailForks page for Zabriskie Point, the coordinates are (36.420820, -116.810120), which is just outside the margin of error.

Image Matting – Jamie

1. Introduction of Image matting and MODNet

MODNet - Portrait Image Matting

Before web development, let’s take a look at a fun model about image matting - MODNet. With image matting, we could merge our selfies with pictures of any us national parks in the file background, or you can upload your own choice.

Image matting, also known as foreground/background separation, is a computer vision technique that aims to accurately extract the foreground object or region of interest from an image, while preserving the fine details and transparency information around the object boundaries. This process generates an alpha matte, which represents the opacity values for each pixel, allowing for seamless composition of the foreground onto a new background.

The MODNet (Modulator-Decoupled Network) model is a deep learning architecture specifically designed for image matting tasks. It was introduced in a research paper by Zhanghan Ke, Jingyu Zhang, Kaihao Zhang, Qiong Yan, and Kaiqi Huang in 2022. MODNet stands out from other image matting models due to its unique approach and several key features:

  1. Decoupled Modulation: MODNet decouples the modulation process of the foreground and background features, allowing the model to better capture the intricate relationships between the foreground and background regions. This decoupling helps to improve the accuracy of the alpha matte predictions, especially around complex object boundaries.

  2. Effective Feature Fusion: MODNet incorporates an effective feature fusion mechanism that combines multi-level features from different stages of the network. This fusion strategy helps to capture both low-level details and high-level semantic information, leading to more accurate and coherent alpha matte predictions.

  3. Lightweight Architecture: Despite its impressive performance, MoDNet has a relatively lightweight architecture compared to other state-of-the-art image matting models. This makes it more efficient and suitable for deployment on resource-constrained devices or in real-time applications.

  4. Improved Generalization: MODNet demonstrates strong generalization capabilities, meaning it can produce accurate alpha mattes even for objects or scenes that are significantly different from the training data. This is a crucial advantage over many traditional image matting methods that often struggle with generalization.

image.png

The key innovation of MODNet lies in its decoupled modulation approach, which allows the model to effectively disentangle the foreground and background features, leading to superior performance in capturing intricate object boundaries and transparency information. This architectural design, combined with effective feature fusion and a lightweight structure, has made MoDNet a notable advancement in the field of image matting.

There are several other state-of-the-art models for image matting tasks, in addition to the MODNet architecture. Here are some notable ones:

  1. GCA Matting: Proposed in 2020, the Guided Contextual Attention (GCA) model utilizes a two-stream encoder-decoder architecture with a contextual attention module. This module helps the model better capture long-range dependencies and global context information, leading to improved performance on complex scenes.

  2. AlphaMatting: Introduced in 2021, AlphaMatting is a transformer-based model that leverages the self-attention mechanism to effectively capture long-range dependencies in images. It achieves impressive results, particularly in handling highly complicated backgrounds and foreground objects with intricate structures.

  3. SHM Matting: The Spatially-Hierarchical Matting (SHM) model, proposed in 2022, employs a hierarchical architecture that processes the input image at multiple spatial scales. This approach helps the model capture both fine-grained details and global structures, leading to improved accuracy, especially around object boundaries.

  4. BGMatting: Introduced in 2022, BGMatting (Background Matting) is a two-stage model that first predicts a coarse alpha matte and then refines it using a background estimation module. This unique approach helps the model better handle challenging cases with complex backgrounds or semi-transparent objects.

  5. HDMatt: The High-Definition Matting (HDMatt) model, introduced in 2022, is designed to produce high-resolution alpha mattes by leveraging a progressive upsampling strategy. It achieves impressive results, particularly for high-resolution images, while maintaining a relatively lightweight architecture.

These models represent some of the latest advancements in the field of image matting, each with its own unique architectural design and strengths. The choice of model often depends on factors such as the complexity of the scenes, the required level of detail, and the computational resources available.

Reference: https://github.com/ZHKKKe/MODNet

Let’s get start it!

2. Preparation

In the top menu of this session, select Runtime -> Change runtime type, and set Hardware Accelerator to GPU.

Clone the repository, and download the pre-trained model:

First we import the os module, which provides functions for interacting with the operating system.

import os


# changes the current directory to /content.
# %cd is a Jupyter Notebook magic command used to change directories within the notebook.
%cd /content

# checks if a directory named MODNet exists in the current directory.
# If it doesn't exist, it clones the GitHub repository located at https://github.com/ZHKKKe/MODNet into a directory named MODNet.
if not os.path.exists('MODNet'):
  !git clone https://github.com/ZHKKKe/MODNet

# changes the current directory to the MODNet directory created or found in the previous step.
%cd MODNet/

# defines the path where the pre-trained checkpoint file will be saved or checked for
pretrained_ckpt = 'pretrained/modnet_photographic_portrait_matting.ckpt'

# checks if the file specified by pretrained_ckpt exists.
# If it doesn't exist, it proceeds with downloading the file.
if not os.path.exists(pretrained_ckpt):
# downloads the pre-trained checkpoint file from Google Drive using gdown.
# The file is saved in the specified path (pretrained/modnet_photographic_portrait_matting.ckpt).
# The --id flag specifies the ID of the file on Google Drive, and -O specifies the output filename.
  !gdown --id 1mcr7ALciuAsHCpLnrtG_eop5-EYhbCmz \
          -O pretrained/modnet_photographic_portrait_matting.ckpt
/content
Cloning into 'MODNet'...
remote: Enumerating objects: 276, done.
remote: Counting objects: 100% (276/276), done.
remote: Compressing objects: 100% (159/159), done.
remote: Total 276 (delta 105), reused 252 (delta 98), pack-reused 0
Receiving objects: 100% (276/276), 60.77 MiB | 37.53 MiB/s, done.
Resolving deltas: 100% (105/105), done.
/content/MODNet
/usr/local/lib/python3.10/dist-packages/gdown/cli.py:138: FutureWarning: Option `--id` was deprecated in version 4.3.1 and will be removed in 5.0. You don't need to pass it anymore to use a file ID.
  warnings.warn(
Downloading...
From: https://drive.google.com/uc?id=1mcr7ALciuAsHCpLnrtG_eop5-EYhbCmz
To: /content/MODNet/pretrained/modnet_photographic_portrait_matting.ckpt
100% 26.3M/26.3M [00:00<00:00, 64.5MB/s]

Now ler’s try this out.

3. Upload Images

Upload portrait images to be processed (only PNG and JPG format are supported):

The following code ensures a clean slate by removing and recreating both input and output folders.

Users can then upload images, which are automatically moved into the input folder for processing.

shutil: This module provides a higher-level interface for file operations, such as copying files and removing directories. google.colab.files: This module provides utilities for interacting with files in a Google Colab environment, including uploading and downloading files.

import shutil
from google.colab import files

Sets up the input folder path where the images will be stored for processing. It checks if the input folder already exists. If it exists, it removes the entire folder and its contents (shutil.rmtree). Then, it creates a new, empty input folder (os.makedirs).

# clean and rebuild the image folders
input_folder = 'demo/image_matting/colab/input'
if os.path.exists(input_folder):
  shutil.rmtree(input_folder)
os.makedirs(input_folder)

Similar to the input folder, this block sets up the output folder path for storing processed images. It checks if the output folder already exists. If it exists, it removes the entire folder and its contents. Then, it creates a new, empty output folder.

output_folder = 'demo/image_matting/colab/output'
if os.path.exists(output_folder):
  shutil.rmtree(output_folder)
os.makedirs(output_folder)

This part allows the user to upload images into the Colab environment.

files.upload() prompts the user to select and upload files. It returns a dictionary where the keys are the uploaded file names and the values are the data.

list(files.upload().keys()) extracts the names of the uploaded files. A loop iterates through each uploaded image file: shutil.move(image_name, os.path.join(input_folder, image_name)) moves each uploaded image file from the current directory to the specified input folder. This step organizes the uploaded images into the input folder for further processing.

# upload images (PNG or JPG)
image_names = list(files.upload().keys())
for image_name in image_names:
  shutil.move(image_name, os.path.join(input_folder, image_name))
Upload widget is only available when the cell has been executed in the current browser session. Please rerun this cell to enable.
Saving 170891_00_2x.jpg to 170891_00_2x.jpg

4. Inference

The following code runs a Python script/module for image matting inference, specifying the input directory containing the images to be processed, the output directory where the processed images will be saved, and the path to the pre-trained model checkpoint file.

Run the following command for alpha matte prediction:

!python -m demo.image_matting.colab.inference \
        --input-path demo/image_matting/colab/input \
        --output-path demo/image_matting/colab/output \
        --ckpt-path ./pretrained/modnet_photographic_portrait_matting.ckpt
Process image: 170891_00_2x.jpg

Let’s break down what each part of the command does:

!python - This is a shell command that tells the system to run a Python interpreter.

-m demo.image_matting.colab.inference - -m flag is used to run a module as a script. - demo.image_matting.colab.inference specifies the Python module to run. It’s likely that this module contains the code for performing image matting inference.

--input-path demo/image_matting/colab/input - --input-path is a command-line argument for specifying the path to the input images directory. - demo/image_matting/colab/input is the path to the directory where the input images are stored.

--output-path demo/image_matting/colab/output - --output-path is a command-line argument for specifying the path to the output directory where the processed images will be saved. - demo/image_matting/colab/output is the path to the directory where the processed images will be saved.

--ckpt-path ./pretrained/modnet_photographic_portrait_matting.ckpt - --ckpt-path is a command-line argument for specifying the path to the checkpoint file for the pre-trained model used in image matting. - ./pretrained/modnet_photographic_portrait_matting.ckpt is the path to the pre-trained model checkpoint file relative to the current directory.

5. Visualization

Display the results (from left to right: image, foreground, and alpha matte):

import numpy as np
from PIL import Image

The following function is useful for visualizing the process of image matting, where the foreground is extracted from the original image based on the provided matte.

from mergePicture import combined_display
import inspect

# Print the source code of the 'query_climate_database' function
print(inspect.getsource(combined_display))
def combined_display(image, matte):
    # calculate display resolution
    w, h = image.width, image.height
    rw, rh = 800, int(h * 800 / (3 * w))

    # obtain predicted foreground
    image = np.asarray(image)
    if len(image.shape) == 2:
        image = image[:, :, None]
    if image.shape[2] == 1:
        image = np.repeat(image, 3, axis=2)
    elif image.shape[2] == 4:
        image = image[:, :, 0:3]
    matte = np.repeat(np.asarray(matte)[:, :, None], 3, axis=2) / 255
    foreground = image * matte + np.full(image.shape, 255) * (1 - matte)

    # combine image, foreground, and alpha into one line
    combined = np.concatenate((image, foreground, matte * 255), axis=1)
    combined = Image.fromarray(np.uint8(combined)).resize((rw, rh))

    # extract the middle image
    middle_image = Image.fromarray(np.uint8(foreground))

    return combined, middle_image

combined_display, takes an image and its corresponding matte (alpha channel) as inputs and returns two images: one for combined display and the other for the middle image (foreground).

Here’s what each part of the function does:

  1. Calculate Display Resolution:
    • It calculates the display resolution for the output image.
    • w and h store the width and height of the input image, respectively.
    • rw is set to 800, indicating the desired width for the output image.
    • rh is calculated to maintain the aspect ratio of the input image.
  2. Obtain Predicted Foreground:
    • Convert the input image and matte to NumPy arrays (image and matte).
    • Check if the input image is grayscale or has an alpha channel. If so, convert it to a 3-channel image.
    • Repeat the matte across channels and normalize it.
    • Calculate the predicted foreground by applying the matte to the input image.
  3. Combine Image, Foreground, and Matte:
    • Concatenate the input image, predicted foreground, and matte along the horizontal axis.
    • Convert the combined array back to an image (Image.fromarray) and resize it to the calculated resolution.
  4. Extract Middle Image:
    • Convert the predicted foreground array to an image (Image.fromarray) to extract the middle image.
  5. Return Output:
    • Return the combined display image and the middle image.

Here’s the explanation of the return values: - combined: The combined image showing the original image, predicted foreground, and matte (alpha channel) concatenated horizontally. - middle_image: The image representing the predicted foreground, extracted from the combined image.

bg_dir = '/content/sample_data/Badlands.jpeg'

# Load the background image of Badlands National Park
background_image = Image.open(bg_dir)

This code segment iterates through all the images in the input folder, visualizes each image with its corresponding matte, and then displays the merged image where the middle image is composited onto a background based on the matte.

# visualize all images
image_names = os.listdir(input_folder)
for image_name in image_names:
    matte_name = image_name.split('.')[0] + '.png'
    image = Image.open(os.path.join(input_folder, image_name))
    matte = Image.open(os.path.join(output_folder, matte_name))
    combined, middle_image = combined_display(image, matte)

    # Display combined image
    display(combined)

    # Display merged
    merged = Image.composite(middle_image,background_image, matte)
    print(image_name, '\n')
    display(merged)

170891_00_2x.jpg 

As you can see, the first line of images corresponds to original image, image we get, and matte(alpha)

The second line is the image we merged with background.

Let’s break down what each part does:

Iterating Through Image Files:

image_names = os.listdir(input_folder)
for image_name in image_names:
  • This loop iterates through each file name in the input_folder, which contains the input images.

Obtaining Matte File Name:

    matte_name = image_name.split('.')[0] + '.png'
  • It extracts the file name of the matte corresponding to the current image by splitting the image file name at the ‘.’ character and appending ‘.png’ to it.

Opening Image and Matte:

    image = Image.open(os.path.join(input_folder, image_name))
    matte = Image.open(os.path.join(output_folder, matte_name))
  • It opens the input image and matte files using Image.open() from the PIL library, specifying their respective paths.

Visualizing Combined Image and Middle Image:

    combined, middle_image = combined_display(image, matte)

    # Display combined image
    display(combined)

    # Display middle image
    display(middle_image)
  • It calls the combined_display() function to create the combined image and extract the middle image (predicted foreground).
  • Then, it displays both the combined image and the middle image using display().

Merging Middle Image with Background:

    merged = Image.composite(middle_image, background_image, matte)
  • It composites the middle image with a background image using the alpha channel provided by the matte.

Printing Image Name:

    print(image_name, '\n')
  • It prints the name of the current image file.

Displaying Merged Image:

    display(merged)
  • It displays the merged image, which combines the middle image with a background using the provided matte.

6. Implementation in Web

:With all the functions we made, we want to realize it in our web. Just like this:

image.png

How are we going to achieve this?

Web Development - Jamie

In the frontend component of our web application, we have meticulously crafted a user interface allowing users to upload two distinct images: a background image and a foreground selfie. These inputs are seamlessly transmitted to the backend server through an API invocation.

On the backend infrastructure, our implementation utilizes Node.js in conjunction with a JavaScript library for subprocess management. Leveraging this architecture, we orchestrate the execution of Python scripts responsible for interfacing with our deep learning model. This model is adept at extracting foreground elements from the selfie image and seamlessly compositing them onto the provided background image. Through a series of sophisticated operations, including image analysis and processing, our system ensures high-fidelity integration of the extracted subject into the selected backdrop.

Here are some background knowledge you might need to know:

JavaScript is a high-level, interpreted programming language primarily used to create dynamic and interactive content on websites. Initially developed by Netscape as a client-side scripting language for web browsers, JavaScript has evolved into a versatile language that can be used for both client-side and server-side development.

Key features of JavaScript include:

  1. Client-Side Scripting: JavaScript is commonly used to add interactivity to web pages, such as responding to user actions like clicks, mouse movements, form submissions, and more. It can manipulate HTML elements, dynamically change styles, and modify content on the fly.

  2. Cross-Platform: JavaScript is supported by all modern web browsers, making it a cross-platform language. This means that code written in JavaScript will run consistently across different browsers and operating systems.

  3. Object-Oriented: JavaScript is an object-oriented language, allowing developers to create objects with properties and methods to represent real-world entities. Objects can be defined using classes or prototypes, and inheritance is supported through prototype chaining.

  4. Asynchronous Programming: JavaScript supports asynchronous programming using callback functions, promises, and async/await syntax. Asynchronous programming allows tasks to be executed concurrently without blocking the main thread, which is essential for handling I/O operations, such as fetching data from servers or interacting with databases.

  5. Functional Programming: JavaScript also supports functional programming paradigms, such as higher-order functions, closures, and anonymous functions. These features enable developers to write clean, concise, and reusable code.

  6. Server-Side Development: With the advent of server-side JavaScript frameworks like Node.js, JavaScript can now be used to build scalable and high-performance server-side applications. Node.js allows developers to run JavaScript code on the server, enabling full-stack development using a single programming language.

Overall, JavaScript is a versatile language that is widely used for web development, ranging from simple scripts to complex web applications. Its popularity and extensive ecosystem of libraries and frameworks make it an essential tool for modern web development.

Node.js is an open-source, cross-platform JavaScript runtime environment that allows developers to run JavaScript code outside of a web browser. It is built on the V8 JavaScript engine, which is the same engine that powers Google Chrome.

Node.js enables developers to write server-side applications in JavaScript, making it possible to use JavaScript for both client-side and server-side programming. This is advantageous because it allows for the reuse of code and skills across different parts of a web application.

Some key features of Node.js include:

  1. Asynchronous and event-driven: Node.js uses non-blocking, asynchronous I/O operations, which means it can handle many connections simultaneously without getting blocked. This makes it well-suited for building scalable and high-performance applications.

  2. Single-threaded: Node.js uses a single-threaded event loop architecture, which allows it to handle many concurrent connections efficiently. It achieves concurrency by delegating I/O operations to the operating system’s kernel, freeing up the main thread to handle other tasks.

  3. npm (Node Package Manager): npm is the default package manager for Node.js, providing a vast ecosystem of open-source libraries and tools that developers can use to build their applications.

  4. Wide range of use cases: Node.js is commonly used for building web servers, RESTful APIs, real-time applications (such as chat applications and online gaming), streaming applications, and more.

Overall, Node.js has become a popular choice for building server-side applications due to its performance, scalability, and the ease of using JavaScript for both client-side and server-side development.

This is the structure of the file on GitHub: front.pic.jpg

This is the backend file structure: image.png

How to run the web application on your computer?

Step 1: Prepare the Node Environment

  1. Visit the official Node.js website at https://nodejs.org and download the installer suitable for your operating system (Windows, macOS, or Linux). Once downloaded, locate the installer file and execute it, following the on-screen instructions for installation.

  2. To ensure that Node.js has been installed correctly, open a terminal and execute the command node -v. This command will display the installed version of Node.js on your system.

Step 2: Start the Backend Server

Navigate to the /backend-main folder in a terminal session.

  1. Create a .env file within this directory containing the MongoDB USERNAME and PASSWORD required for database connectivity.
USERNAME=jamieluo
PASSWORD=oFWMJnNpsd9i0Gvp
  1. Execute the command npm install to install the project dependencies.

  2. Run the command npm run dev to start the backend server, establishing a connection to the MongoDB server and enabling it to listen for incoming requests from the frontend.

Step 3: Start the Frontend Server

Navigate to the /frontend-main folder in another terminal session.

  1. Execute the command npm install to install the project dependencies.

  2. Run the command npm run dev to initiate the frontend server.

  3. Note: As the frontend and backend servers typically run on different ports locally, and due to browser security policies that may block cross-origin requests, we utilize a middleware to circumvent this limitation. If your frontend project is not running on port 5174, please adjust the port number in line 13 of the /backend-main/server.js file accordingly.

  4. Upon successful execution, a link will appear in your terminal indicating the URL for accessing the frontend. Click on this link to interact with the website.

By following these steps, you can set up the Node.js environment, start the backend server, initiate the frontend server and play with our website.

Now you can see what the main page looks like with filters:

image.png

When you click the login it will show:

image.png

If you are a new user, it will show: image.png

And of course the Photoshop page you see it before.

What technology stack did we use in our web application?

  1. Data Collection: Utilizing Python with Selenium, we scrape data from websites like AllTrails and TripAdvisor, leveraging their HTML DOM structure to extract the necessary information. This data is then stored in CSV format.

  2. Data Analysis and Processing: After collecting the data, we analyze and process it to create structured JSON-formatted data. This processed data is crucial for further operations.

  3. Database Integration: Using Mongoose, we seamlessly integrate the processed data into MongoDB. This allows for efficient storage and retrieval of data for future use.

  4. Frontend Development: Employing Vue.js framework, we design a user-friendly UI/UX to deliver a streamlined experience. Users can effortlessly browse and favorite trails, while we utilize their browsing history and favorite trails’ characteristics to provide personalized recommendations.

  5. Innovative Feature Integration: We incorporate an intriguing feature where users can upload their selfies. Leveraging deep learning techniques, we extract the human body parts from the images. Users can then select from a range of scenic images provided, and seamlessly paste their extracted images onto these backgrounds, creating personalized compositions.

  6. Backend Implementation: Using Node.js and Express.js along with MongoDB, we architect robust APIs to serve the required data to the frontend. This backend infrastructure ensures smooth communication between the frontend and database, enabling seamless functionality for the users.

Ethical Ramifications and Concluding Remarks

We have no control over what users do with the recommendations that they receive. This could mean that, based on certain criteria, that they engage in malicious behavior on certain trails, for example, or they may attempt to monetize the results from our tool despite the fact that we’d like it to be freely available. Even if we include disclaimers, warnings, or agreements that users have to abide by, once a recommendation is generated, it’s out of our hands.

As for biases, language processing tools may only be used to certain linguistic conventions, which may prioritize the results from certain reviews over others. We don’t have experience using NLP, so this is something that we anticipate having to address as we go along. See “Risks” above.

All in all, we think that we executed our project well on the technical portion. We gave users most similar trails based upon reviews and supplemented this information by merging these similar trails with the numeric data on TrailForks. We think we could improve by implementing this onto a website better such that the person does not have to open our Jupyter Files and manually input the trails they want.